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We analyze the learning of noisy teacher-generated examples by nonlinear and differentiable stu- 
dent perceptrons using the cavity method. The generic activation of an example is a function of 
the cavity activation of the example, which is its activation in the perceptron that learns without 

,—1 , the example. Mean field equations for the macroscopic parameters and the stability condition yield 

results consistent with the replica method. When a single value of the cavity activation maps to 
multiple values of the generic activation, there is a competition in learning strategy between pref- 
erentially learning an example and sacrificing it in favor of the background adjustment. We find 
parameter regimes in which examples are learned preferentially or sacrificially, leading to a gap 
in the activation distribution. Full phase diagrams of this complex system are presented, and the 
theory predicts the existence of a phase transition from poor to good generalization states in the 
system. Simulation results confirm the theoretical predictions. 
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I. INTRODUCTION 

Since Hopfield's pioneer work on neural networks [jlj, statistical mechanics has proved to be a powerful tool in the 
study of information processing. Mean field theories such as the replica method |0-0| and the cavity method PJq-wf] 
are successfully developed to study these problems. In particular, it provides valuable insights to the learning of 
examples in neural networks by considering it as an energy minimization process. Early work used the replica method 
to study the learning problem in various situations flEMLJfl. It has the advantage of a readily-used mathematical 
formalism applicable to general cases, and has been applied to linear networks Jl2|"[l5[| and networks with binary 
(/"■) • outputs |0,lra-El|, dealing with learning tasks which are either realizable or non-realizable, random or teacher- 
generated data, and clean or noise corrupted data. These studies mainly focused on the global properties of the 
r ■ learning system, with less emphasis on the microscopic description of the examples and the weights in the system. 
Furthermore, most of these models were still remote from the differentiable nonlinear perceptron which is most 
commonly used today. Other work used the Green's function approach which is particularly convenient for linear 
networks [|22|, but these systems may not have the competitive effects among examples in nonlinear networks, which 
will be investigated in this paper. The annealed approximation is suitable for analyzing high temperature learning 
j| , but the results cannot be directly extended to the more common case of low temperature. 

A common phenomenon observed in the studies of learning from examples is the existence of phase transitions with 
I ■ abrupt improvement in the generalization ability of the networks once the training examples are sufficiently numerous, 
or the global parameters (e.g. the weight decay) is suitably tuned [p3|-p7[. These transitions are often discontinuous. 
They arises when metastable states are present in the system, leading to discontinuous jumps in the network states, 
hystereses, and the disappearance of metastability at spinodal points. Multilayer perceptrons will exhibit a transition 
from permutation symmetric to specialized states M. In the present paper, we will see these effects in nonlinear 
perceptrons learning noisy examples. Here the competition between the locally stable states comes from the different 
learning strategies used to attain the systemwide energy minimum. 

The cavity method is a suitable tool to study information competition effects in rule extraction from noisy examples. 
Large scale neural networks with many nodes can be considered as mean field systems since, as far as the learning of 
one example is concerned, the influence of the rest of the examples can be regarded as a background satisfying some 
average properties. The success of the mean field approach is illustrated by the capability of the replica method in 
describing the macroscopic properties of neural network learning . However, the replica method provides much less 
interpretation on the processing of individual examples since its starting point is the quenched average of the free energy 
over the example distribution. The cavity method is an alternative version of mean field theory. It is a generalization 
of the Thouless-Anderson-Palmer (TAP) approach to spin glasses and starts from a microscopic description of the 
system elements p8|,p[. In this method mean-field equations are derived from self-consistent considerations. The 
method was subsequently generalized to learning problems ]||j[g9J and yields macroscopic properties identical to the 



replica method while at the same time provides physical insights to the learning of individual examples. Recently, 
the cavity method was also applied to a number of problems in information processing |30|] . 

In this paper, we study the learning of noisy examples in nonlinear perceptrons using the cavity method. Nonlinear 
networks have the following advantages: (i) compared with networks with binary output, gradient descent learning is 
possible, (ii) nonlincarity is representative of more complex networks, (iii) they have more resemblance with biological 
neurons |3l|] . Compared with previous studies, we will focus on the effects of information competition in the system, 
and their consequences on the energy landscape, the appearance of band gaps in the activation distribution, the choice 
between preferential and even-handed learning strategies as well as their possible relationship with phase transitions 
in this complex system. We analyze the parameter regimes with band gaps in the activation distribution, as well as 
the stability condition of the perturbative cavity approach. Simulation results show that the assumption of a smooth 
energy landscape usually works well when no gaps are present, but tends to fail when gaps appear. Like spin glass 
models, the picture of a smooth energy landscape has to be replaced by one with many metastable states that can not 
be related by perturbative analyses. The assumptions of smooth and rough energy landscapes are equivalent to the 
the replica symmetric (RS) and replica symmetry-breaking (RSB) approximations in the replica method. The phase 
diagram of this complex system is shown and the occurrence of phase transitions is investigated and compared with 
simulations. 



The rest of this paper is organized as follows. After describing the model in the next section, we describe in Sec. |II 



the cavity approach and introduce the cavity activation, which is the core microscopic variable in the cavity method 



Three self-consistent equations are derived when a smooth energy landscape is assumed. In Sec. IV, we discuss the 
case when band gaps appear in the activation distribution. Phase transitions in nonlinear perceptrons and phase 
diagrams are the themes of Sec. [V| In Sec. VI we summarize the results and their implications. Mathmetical details 
are appended at the end of the paper. 

II. THE MODEL 

Consider a student perceptron with N weights Jj, j = 1, . . . , N that connect the N input nodes and the output 
node. It is trained to extract the rule of a teacher perceptron with the same architecture with N weights Bj, 
j = 1, ... ,N, where (Bj) — and (B?) = 1. A training set of p examples generated by the teacher and corrupted 
by noise is what the student can explore. Each example, labeled fi with ji = 1, . . . ,p, consists of an input vector ^ 
and the noisy output O m of the teacher. The input components & are Gaussian random variables, with (£^) = 
and (o£fc) = 3jk5fj,v The activation functions f(x) of both perceptrons are differentiable and nonlinear, such as 
sig(x) = (1 — tanhx)/2, i.e., the teacher and student outputs are respectively 



and 



o„ = /(*/„) = /(%, + iv) (i) 



U = /(*/*)> ( 2 ) 



where y^ = B ■ ^/vN is the teacher activation, r]^ is Gaussian noise with (r/^) = and (•qV) = 1, T is the noise 

temperature, and x^ = J ■ ^ /\N is the student activation. 

During the training procedure, one adapts the student network to minimize an energy function that measures 
the difference between the student outputs /^ and teacher outputs O m for all training examples. A natural energy 
function is the total quadratic error of examples in training set, 5Z„ =1 (0 M — /m) 2 — P £ t> where we call St training error. 
However, the final target of learning is to get a student perceptron that can generalize well to novel examples, i.e., to 
minimize the generalization error e g = ( [0(£) — /(C)] 2 ) 1 ' 2 ) where () is the average performed over the distribution 
of all inputs and the noise. We add a weight decay term to penalize excessively long weight vectors and speed up 
learning, and use the energy function 

M j 

where A is the weight decay strength. Minimizing the above energy function by gradient descent, one obtains the the 
equilibrium state of the student perceptron given by 

4 = -r7w£(°M-/iW. W 
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where the prime in /' represents the derivative of f{x^). Here we are interested in the dependence of the generalization 
error e g of the student perceptron in its equilibrium state (|J) on the macroscopic parameters, such as the weight decay 
strength A, the noise temperature T and the size of training set a = p/N. As in perceptrons with linear or discrete 
activation functions, the generalization error is essentially determined by the overlap of the student weight vector 
with the teacher weight vector R and the magnitude of the student weight vector q, which are defined as 

q = (Jf) j and R=(J j B j ) j , (5) 

where ( )j represents averaging over the N weights. 

III. THE CAVITY METHOD 

In order to get more microscopic understanding of the mechanism in the learning of neural networks, we use the 
cavity method developed in [p 29 to tackle the current problem. After the student perceptron is trained with p 



examples, it reaches its energy ground state J given by Eq. (|4|). Suppose a new example with input vector £° is fed 
to the student perceptron. The activation of example is now given by 

^-^/•!°, (6) 



which is called the cavity activation. Since the student J has no information about an example it has never learned, 
the cavity activation t is a Gaussian variable for random inputs £° when N ^> 1. It has a mean ((t )) = and 
covariances ((to)) = Q and ((toj/o)) = R, where (( )) denotes the ensemble average. Hence the distribution of the cavity 
field is 

r _ (*o-^/o) 2 i 

P{t ° lV0) ~ ^2n( q -Ri) ■ (?) 

Trained with all the p + 1 examples {(£^, O m )|/z = 0, 1, . . . ,p}, the student perceptron reaches its equilibrium state 
J°, with 

4 = ^(°o - /o°)(/o°)^ + 3^ £(0, - Wtf- (8) 

Here and below, variables with superscript refer to those associated with the perceptron J° , which includes example 
in its training set. We see that the generic student activation of example 0, Xo = J ■^ /v / ^V, is no longer a Gaussian 
variable. (Although the correct notation of xq should be Xq, here we omit the superscript since it is sufficiently distinct 
from its cavity counterpart to.) However, it is reasonable to assume that the difference between J and J° is small; the 
validity of this assumption will be discussed later. Following the perturbative analysis in |6|, we show in Appendix |Aj 
that, for a given corrupted teacher output j/Oi there is a well defined relation between io and Xo, to — t(xo,yo), where 

t(x,y)=x-j[f(y)-f(x)]f'(x). (9) 

Here the parameter 7 is the local susceptibility and satisfies 

tfi 1 fj. 



l-7A = «(l-^) , (10) 



where x^ is a single-valued function of t^, and (-) M represents averaging over the p examples. In this section we will 
focus on the case that Eq. (|9j) presents a one-to-one mapping between x^ and t M for a given y^. As we shall see, this 
corresponds to a continuous activation distribution with no band gaps. In the next section we will discuss the case 
when t^ has a one-to-many relation with x^, which will lead to the emergence of band gaps. 
Combining Eqs. (0) and (g), we can derive the student activation distribution P(x\y,y), 

P(x\y,y) = P(t(x,y)\y)^§^- (11) 

In turn, the distributions P (y\y) and P{y) are given by 



my) = 7m° pl - ^f^ 1 (12) 

p,! " = _ir xp( -T>- (13) 

Equation (nfl) for 7 can now be transformed to an integral expression when N approaches infinity, 

1 - 7 A = aJdyP(y)JdyP(y\y)JdxP(x\y,y)(l - ^), (14) 

where dx/dt = (1 +7{[f'{x)] 2 — [f(y) — f(x)]f"(x)}) . Equation ( |l4]) can be simplified into an equation involving 
only double integrals, 

1 - 7 A = a f Du f Dv[l - (1 + l{[f{x)] 2 - [f(y) - f(x)]f" '(x)})- 1 } (15) 



where Dm = du exp(— u 2 /2)/v2~7r and Dv = dv exp(— v 2 /2)/v27t are Gaussian measures, y = \/l + T 2 u, and 
depends on u and v via 



R R 2 , ~^ 

-u+\ q- v = t(x,y). (16) 



x/TTT 2 " V ^ 1 + T 2 

The mean field equation for R can be obtained by multiplying both sides of Eq. (^) with Bj, and summing over j, 
yielding 

R= ~ J dyP(y) J dyP(y\y) j ' dxP(x\y,y)[f(y) - f(x)]f'(x)y. (17) 

Substituting Eqs. (pd[]T^), we can simplify it to 



R=j f Du f Dv[f(^TTf^u) - f(x)]f(x)\^J==u + RT v], (18) 

where x depends on m and u via Eq. (|16|). Integrating by parts and using Eq. ( |l5[ ) and (|16|), we have 

p /" n f u .f'(y)f(x) riQ , 

R = a y Du J m i + i{ifi(xW-im-m]n^ (19) 

Similarly, multiplying both sides of Eq. (0) by Jj, and summing over j, we have 

q = j J dyP(y) J dyP{y\y) J dxP{x\y, y)[f(y) - f{x)]f{x)x. (20) 

Again, simplifying into double integrals and integrating by parts, we arrive at 

q-R 2 = o n 2 [d U fvv[f{y)~f{x)] 2 [f{x)] 2 . (21) 



The three macroscopic parameters 7, R and q can now be obtained by solving the three mean field equations (|lq) , 
( |l9| ) and (Ell) numerically for given values of a, A and T. Therefore we can directly obtain the training error e t and 
generalization error e g , which depend on the generic activation x and cavity activation t respectively, 

s 2 = JvujDv[f(y)-f(x)] 2 , (22) 

e 2 = Jvujm[f(y)-f(t)] 2 . (23) 

The validity of the perturbative calculation can be checked by considering the stability condition of the equilibrium 
state. As derived in Appendix [b[ when the new example is added, the magnitude of the change in the student 
weight vector is given by 

a, = £(■? - J ^ 2 - — /T'f ^ ■ ^ 



l-a((l 






Hence A diverges when the denominator approaches 0. This yields the stability condition 

dx 



- Wh < '■ (25) 



It is identical to the stability condition of the replica-symmetric (RS) ansatz in the replica approach [|lll,IS| , the so-called 
Almeida-Thouless (AT) condition J32J. 

In the region where the stability condition ( |2q ) is violated, the perturbative version of cavity method breaks down. 
It becomes possible that when a new example is added to the system, the ground state relocates to another metastablc 
state. This corresponds to the picture of a rough energy landscape with many metastable states and the perturbative 
cavity method has to be modified [E9[ . 

IV. ACTIVATION DISTRIBUTIONS WITH BAND GAPS 

When the activation function f(x) is nonlinear, the behavior of the system may be very complex. This can be seen 
by considering Eq. (g) for a sufficiently large 7, when the generic activation x may become a multi- valued function 
of the cavity activation t. In such a case, the system settles at its ground state, i.e., chooses the value of x that 
minimizes the energy function in Eq. (0). 

The energy increase on adding example can be derived easily. According to Eq. (g) , the energy difference between 
the perceptron states J° and J is 

AS^^-S=i(O -/ ) 2 + i^[(O M -/0) 2 -(O,-/ Al ) 2 ] + ^[(j0) 2 -jJ]. (26) 

M 3 

Expanding the first summation to the second order (x® — x^) 2 and substituting Eq. (||) and Eq. (g) to the second 
summation, we can simplify the above equation to 

AS = \(O - /o) 2 + \{Ov - fo)fo(x - to)- (27) 

Using the relation between the cavity activation i and generic activation xo in Eq. (g), we find 

A£=i(Oo-/o) 2 + ^(xo-io) 2 . (28) 

The first term is the primary change due to the newly added example, and the second term results from the adjustment 
of the background examples. In the multi-valued region, one needs to compare the energy increase of the solutions 
whose values of Xq are closer to to (therefore favorable to small background adjustment) with those whose outputs 
/o are closer to the teacher's outputs Oq (therefore favorable to small primary cost). This competition leads to 
a discontinuity in the range of the generic activation xo when the cavity activation to varies, accompanied by the 
appearance of gaps in the activation distribution for a given teacher output. 

To study this competition, we suppose that Eq. (g) has multiple solutions of x in a range of t, for a given y. We are 
interested in the point t g (y) where two solutions yield the same energy change AE. That is, there are two distinct 
values of x, x< and x>, such that t g = t(x < ,y) — t(x > ,y) and AE(x < ) — AE(x > ). Then using Eq. (Eq), we arrive 
at the condition 

<E>(tg) 

t(x)6X = tg[X > (tg)-X < {tg)], (29) 

which is the Maxwell's construction as shown in Fig. \j\. As the result of energy minimization, one of the two solutions 
of x is preferred on the left neighborhood of t g , while the other is preferred on the right. Hence a; is a function of t 
with a discontinuity at t g . 

Consequently, the student activation distribution P(x\y) for a given teacher output y becomes zero when the student 
activation x is located in the band gap [x < (t a ),x > (t g )]. Due to the effect of band gaps in P(x\y), extra terms should 
be added to the mean field equations Eq. (|l5|) and Eq. ([L9h for 7 and i?, as derived in Appendix C, namely, 



1 - 7 A = a J{Vu J2 J D«[l - (1 + l{[f{x)Y - [f(y) - f(x)]f (x)})- 1 } 

-a j " DuJ2G(4)[x>(4) - *<(*£)], (30) 

j 

r= /n V /" n /W(*) 

a7 y u ^y^ V l + 7{[/'(^)] 2 -[/(y)-/W]/"W} 

+a 7 |D«^G(tj)/'fe-)[/(x > (tj)) - /(*<(#)], (31) 

where each term in the summations over i corresponds to an integration over a region R t separated from each other 
by band gaps, and each term in the summation over j corresponds to a band gap. The Gaussian factor G{t J ) is given 
by 

G{t? g ) = l - exp [ - (t ;7 ^^ } ■ ( 32 ) 

We note that the extra terms due to gaps are consistent with adding the delta function component (x> — X<)5(t— t g ) 
to dx/dt in Eq. © and [/(»>) - /(x<)]5(t - t 9 ) to f(x)dx/dt in Eq. ©. 

For the sigmoid function /(x) = (1 + e~ x ) , the necessary and sufficient condition for Maxwell's construction, as 
derived on Appendix D, is 

m-\>W(l) for x<0 

±-f(y)>W( 1 ) for x>0, W 

where the function W(j) is monotonic, as shown in Fig. |2| The behavior of the activation distribution depends on 
the value of 7 in the following three cases: 

Case 1: 7 < (117 + 165\/33)/64 w 16.64. As W{j) > 1/2 and < f(y) < 1, the condition (||) cannot be satisfied 
for all teacher output /(y). Hence there is no gap in the activation distribution. 

Case 2: 16.64 < 7 < 48. Here < W(j) < 1/2. The activation distribution starts to develop a band gap which 
extends from f(y) = 1 to f(y) = 1/2 + W(~f) in the region f(x) < 1/2. Similarly, another band gap extends from 
f(y) = to f{y) = 1/2 — ^(7) in the region f(x) > 1/2. The two band gaps are symmetric with respect to the point 
(f(x), f(y)) — (1/2, 1/2). For intermediate teacher output between 1/2 ± ^(7), the distribution remains continuous. 
The gapped regions are bounded by solid lines in Fig. 0. 

Case 3: 7 > 48. Here W(^f) < 0. The band gap in the region f(x) < 1/2 now extends from f(y) = 1 to 
f{y) = 1/2 + IF (7) < 1/2. Together with its symmetric counterpart in the region f(x) > 1/2, the activation 
distribution is three-banded for 1/2 + W(^) < f(y) < 1/2 — W(j), beyond which activation distribution remains 
two-banded. 

When a band gap is present for a given /(j/), multiple energy minima can exist for a generic student activation x 
near the gap. When the energy minimum favors the generic activation to take a value closer to the teacher activation 
than the cavity activation, the example is preferentially learned. Otherwise, when the generic activation is closer 
to the cavity activation, the example is sacrificed. As shown in Fig. |3|, a band gap exists in the regions that are 
shaded or enclosed by the transition lines L* r and L' r (subscripts s and p represent sacrificed and preferred states 
respectively). In the neighborhood of the sacrificed band edge, the line L c s indicates the onset of competition. Between 
the lines L c s and I/' r , the sacrificed state is competing with a metastable preferred state, which appears between the 
spinodal line LfP and the line L* r . As illustrated in Fig. [j], the stable states between points P5 and P< competes 
with metastable states between points P> and P> r , but the sacrificed states remain the ground states. Between line 
L 1 / and the spinodal line L s s p , the sacrificed state becomes metastable. It disappears at L s s p . Similar lines exist in the 
neighborhood of the preferred band edge. 

The condition of the band gaps and preferential learning shown in Fig. |3| is for 7 = 20.55. In the ground state, 
no examples exist in the shaded region, rendering a gap of the activation distribution. One finds that preferential 
learning first occurs at extreme values of the teacher output, f(y) > f(y + ) = 0.864 or f(y) < f(y~) = 0.136. For 
f(y) < 0.136, student activations to the left of the shaded region correspond to the preferred examples, whereas 
those to the right correspond to the sacrificed ones. The energy advantage of this learning strategy can be easily 
understood. In nonlinear perceptrons, changes in the student activation around these extreme values of f(y) do not 
result in significant changes in the training error of an example due to the saturation in this region, and if the cavity 



activation is very different from the teacher's activation, it is more economical to keep the student activation close to 
the cavity activation, so that the background adjustment remains small. In contrast, for intermediate values of f(y), 
the competitive effects are less, and no band gaps develop. 

The width of the band gap can be narrowed when the existence of metastable states is taken into account. As 
shown in Fig. pi metastable states exist inside the band gap as far as the spinodal lines Lf and L^ 1 . Hence in 
finite-time simulations, the system may be trapped in metastable states. Conventionally, the narrowing of band gaps 
in simulations is explained by RSB effects in the replica method |l^,|2(|. Here we conclude that the narrowing can 
be explained by metastability in the perturbative cavity method, which is equivalent to the replica symmetric ansatz, 
without invoking the formalism of RSB. 

For comparison, this kind of preferential learning is not present in linear perceptrons, even when perfect learning 
is impossible. Since f(x) = k(x + 9), the activation x becomes a linear function of the cavity activation t, by virtue 
of Eq. (Kj) . Hence preferential learning is a unique consequence of the nonlinearity of the perceptron activation. 

Figure [| shows the parametric regimes for the existence of gapped activation distributions as well as the unstable 
regimes of the perturbative cavity method (the boundary line being equivalent to the AT line in the replica method) 
for different noise temperature. Since the development of a gap is already sufficient to cause an uncontrollable change 
in (E4), the gapped regions lie inside the unstable regions. Furthermore, provided that a and T are not too large, the 
boundaries of the gapped and unstable regions are very close to each other. The region of small weight decay and 
large noise will be discussed in the next section, where the phase lines are modified when discontinuous transitions 
take place. 

It is shown in Fig. that band gaps in the activation distribution exist when the training set size a is small, 
noise temperature T is large and weight decay strength A is insufficient, leading to the preferential learning of some 
examples while sacrificing others. When the training examples are sufficient, the underlying rule can be extracted 
with confidence, thereby restoring the continuous distribution. Furthermore, increasing the data noise broadens the 
gapped region. Indeed, noisy data introduces conflicting information to be learned by the student. On the other hand, 
the gapped region narrows with increasing weight decay strength. Arguably, weight decay restricts the flexibility in 
the weight space, thus reducing the tendency for multiple minima. 

We check the appearance of band gaps predicted in our theory with simulations. Four typical activation distribu- 
tions, all with a — 3 and A = 0.002, are plotted in Figs. §|(a-d) at increasing T. To facilitate comparison, examples 
are collected for the noise-corrupted teacher activation y at the value of — yl + T 2 , so that the probabilities P{y) are 
the same. 

Figure ||(a) is the distribution at T = 0.1, where 7 = 11.1 and the stability condition ( |25| ) is fulfilled. The student 
activation distribution in this case has a single band and is a sharp peak at x = y. When noise temperature T 
increases to 2 where 7 = 14.9, the location of the parameters a = 3 and A = 0.002 in Fig. is slightly above the 
boundary between gapped and continuous regimes. Correspondingly, there is a pseudogap developed in the activation 
distribution, as shown in Fig. ra(b). Comparing with simulation results, we see that the assumption of a smooth 
energy landscape used in the present work is valid in this regime. As shown in Table I, the theoretical and simulation 
results of the training error e t , the generalization error e g , the weight overlap it! between teacher and student and the 
student weight magnitude q also agree well. 

When T = 2.5, where 7 = 20.6, the stability condition is violated and there is now a gap in both the prediction of 
the cavity method and the simulation result, as shown in Fig. H(c). However, we can see from Fig. H(d), where T = 5, 
the theoretical prediction of the band gap is broader and has sharper edges than the simulation one. At the same time, 
as shown in Table I, there are prominent differences of e t , R and, especially, q. Two arguments are relevant. First, 
the narrowing of the band gap can be explained by the presence of metastable states in the band gap as discussed 
in Fig. 0. These metastable states probably prevent the learning process to converge to the ground state, which 
therefore yields a value of q different from the theory. Secondly, due to the violation of the stability condition ( |2q ) 
when the band gap develops, a rough energy landscape as discussed previously J29|] must be introduced to improve 
the agreement. It is hopeful that the first step replica symmetry-breaking ansatz will predict a more consistent result 
with the simulation, such as a shallower band gap and a smaller value of q. For exact solutions, it was pointed out 
recently that whenever there is gap in the activation distribution, full RSB analysis is necessary |2l| . 

In Fig. @, we plot the activation distributions predicted theoretically for different a with the same noise temperature 
T = 5 and weight decay strength A = 0.001. One finds that while Fig. shows that insufficient examples cause the 
appearance of band gaps, here one finds it is possible that the fraction of examples located in the sacrificed band 
decreases with the size of the example set. Therefore, the competitive effects of learning strategies are serious only 
when both noise and the size of training set are large, as one may expect intuitively. 



V. PHASE TRANSITIONS 

Another consequence of nonlinearity is the existence of two stable solutions of R, q and 7 to the mean field equations 
(p0|), ( J3l] ) and (|2l| ) for a given set of parameters. We plot the curves of A versus 7 in Fig. for different values of a at 
a given noise temperature T. Studying the behaviors of the curves, and hence the accompanying phase transition in 
different ranges of a, we find two critical parameters a*(T) and oeo(T) for a given noise temperature T. [a* c {\) = 1.65, 
ao(l) = 1.737.] 

Case 1: a < a*{T). A is a monotonic decreasing function of 7. Hence for any weight decay strength, there is 
a unique local susceptibility. Numerical results in this region shows that the magnitude of student weight vector q 
increases with decreasing weight decay A. 

Case 2: a* (T) < a < ao (T) . At a — a* (T) , multiple solutions of 7 for a given A start to appear near the inflection 
point of the curve. The solution with the smallest 7 corresponds to the good generalization solution with small q 
and small e g . The solution with the largest 7 corresponds to the poor generalization solution, with large q and e g . In 
between the two solutions, there is a third, unstable, solution, which can be considered as the barrier separating the 
two stable solutions in the energy landscape. When a increases beyond a*(T), the intermediate range of A for which 
multiple solutions exist becomes increasing wide. 

At very large A, the good generalization state is the only stable solution. When A decreases, a metastable state 
with poor generalization appears at the spinodal point X p (a,T). When A decreases further, the globally stable state 
switches from the good generalization state to the poor as a discontinuous phase transition at X c (a,T). The point 
where this first-order transition occurs has to be determined by comparing the energy of the two states. On further 
decrease of A, the metastable state of good generalization disappears at another spinodal point X g (a, T). Hence a* (T) 
is a critical point where discontinuous transition first appears. 

Case 3: a > o.q(T), At a = oeo(T), the spinodal point X g of the good generalization state vanishes. Hence both 
poor generalization and good generalization solutions coexist for A below X p down to zero. Here the example set is 
large enough to provide information about the teacher such that the good generalization solution exists even in the 
absence of weight decay, although it is only metastable. 

The dependence of the generalization error e g on the weight decay strength A is shown in Fig. ra for different sizes 
a of training set. When a < a*(T), there is no phase transition, and the generalization error decreases continuously 
on increasing A till the optimal weight decay strength A op t(T)(« 0.05, for T = 1), where the perceptron generalizes 
best. Similar phenomena are also found in linear perceptrons learning noisy examples and constrained with weight 
decay |15| . However, the situation in nonlinear perceptron learning noisy examples becomes more complex for larger 
a. When a > a*(T), the energy curve has two stable branches that cross at X c (a,T) (as illustrated in the inset of 
Fig. |8|). Here the thermodynamic transition of e g takes place discontinuously. In the range of A around A c , metastable 
states separated by an energy barrier exist. Hence in practice, the transition between the good and poor generalization 
states may not take place at the same point, leading to hysteretic effects. 

The existence of the discontinuous transition when A changes, accompanied by the hysteretic effects, is verified by 
the simulation of a sample in Fig. |9_|, where a = 4, T = 5. It is interesting to observe a third state with intermediate q 
and e g . The existence of such intermediate states is not uncommon in simulations, although transitions between the 
poor and good generalization are mostly direct, as predicted by the theory. Considering the stability condition fl25| ) 
for the parameters used in Fig. M, we find that the perturbative cavity solution is stable in the good generalization 
phase, but unstable in the poor one. This implies that multiple metastable states may exist in the poor generalization 
phase, contributing to the cascading transition observed in Fig. g. 

Similarly, discontinuous transitions occur when a increases for a given A. The learning curve for different weight 
decay strengths is plotted in Fig. 10(a) for T = 1 and in Fig. [nj(b) for T = 5. One see that in both cases, the student 



may even learn worse for more training examples if the training examples are not sufficient. Only after sufficient 
examples are fed to the student will e g decrease asymptotically on increasing a. For smaller weight decay, there is a 
discontinuous transition from a good to a poor generalization state at a critical example size a c (A,T). Discontinuous 
transitions on changing a and A are also observed in the high temperature limit in multilayer networks learning clean 
examples [E5[. 

Sample averaged simulations for T = 1 and A = 0.0001, as shown in Fig. |ll](a), show that theory and simulation 
agree satisfactorily on both sides of the bump. However, theory predicts a relatively abrupt change of e g for a around 
1.6, which is not observed in the simulation. This discrepancy may be partly due to the finite size effects, but we 
cannot preclude that effects of rough energy landscape (RSB) also contribute. 

This discrepancy between theory and sample averaged simulations is also observed at A = 0.001 and T = 5, as 
shown in Fig. |ll|(b). Here we see that discontinuous transitions exist for large noise temperature T. Hysteretic effects 
are shown by the different values of the transition points in the upward and downward directions of changing a, given 
by a" (A, T) = 4.84 and a^(A, T) = 4.29 respectively. The theoretical prediction of a c (X, T) is obtained in Fig. [lj from 



the intersection of the energy curves of the branches of poor and good generalization states. However, this prediction 
of a*(A, T) = 5.95 is higher than the position of hysteresis. Again, we attribute the discrepancy to finite size and the 
rough energy landscape. 

We can interpret the effects of a rough energy landscape from the comparison between theoretical and simulation 
results. As shown in Fig. flWb), the correction due to a rough energy landscape is minor for small a. Although 
a band gap exists in the activation distribution, the statistical weight of the outlying bands is only very small, as 
shown in Fig. g. When the size of the training set increases, the increasing weight of the outlying bands, as shown 
in Fig. y, implies stronger effects of rough energy landscapes, which may account for the lowering of the critical a of 
the discontinuous transition in simulations when compared with the prediction of a smooth energy landscape. The 
smooth ansatz is stable for the branch of good generalization state in Fig. [lj, but unstable for the poor one. Hence 
the introduction of the roughening effects will modified the energy curve of the poor generalization state, while that 
of the good one remains unchanged, thus shifting the position of the crossing point. The lowering of the a value in 
simulations implies that the energy of the poor generalization state is higher when we change from a smooth picture 
to a rough one. This is consistent with previous results that RSB increases the energy of similar perceptrons with 
discrete outputs |l9| , |2f| . 

The full phase diagram is drawn in Fig. pl| for a given noise temperature T. Above and below the thermodynamic 
transition line, line a, the perceptron is in the good and poor generalization phase respectively. Line a ends at the 
critical point P, where a — a*(T). The values of q and e g change discontinuously when the global parameters move 
cross line a, but continuously when they move around point P without crossing line a. The difference between the 
continuous and discontinuous transitions around point P is illustrated in Fig. [ll](a-b). A discontinuous transition 
across line a below P would look like Fig. [ll](b) whereas a bumpy crossover, rather than a sharp transition, would take 
place above P as shown in Fig. pd| (a). Cuspy behavior similar to that in Fig. fllKa) is also found in linear networks 
learning un-realizable tasks [ fl3||14| . Further above P, the bump smoothes out and the position of a with maximum 
e g shifts towards 1, as shown in Fig. fUM for increasing A. The position of the maximum also depends on the noise 
temperature T. For small values of T, the maximum stays near a = 1, but for larger noise, the maximum could move 
to higher values of a, which implies that more examples are required for the student to really learn some essence of 
the teacher's rule when the noise is stronger. 

Line b denotes the stability line separating the regimes of smooth and rough energy landscapes. The rough regime 
covers the entire region left of the stability line as well as the entire poor generalization phase below line a. Here the 
position of line a is estimated assuming smooth energy landscape. Simulations such as those in Fig. [Ll|(b) indicate 
that the effects of rough energy landscapes may shift its position leftwards. The boundary line between gapped and 
ungapped regions is effectively indistinguishable with the stability line at T = 1 . Line c is the spinodal line of the poor 
generalization phase, where A = X p (a,T). The poor generalization phase is metastable in the shaded region bounded 
by the lines c and a. Similarly, line d is the spinodal line of the good generalization phase, where A = X g (a,T), with 
the good generalization phase being metastable between lines d and a. When A approaches zero, the abscissa of line 
d approaches aa(T). Both lines c and d are computed in the smooth ansatz only, with roughening effects neglected. 

It is interesting to consider the change of learning strategy in different regions of Fig. [13[ In the region bounded 
by lines c and d, more than one learning strategies are competing against each other, corresponding to different local 
minima in energy. To the left of line b, all states adopt learning strategies which sacrifice a fraction of examples, 



but those with large q (poor generalization as shown in Fig. 12) sacrifice a significantly large fraction. To the right 
of line 6, the competition takes places between states with large q, which sacrifice a fraction of examples, and those 
with small q (good generalization as shown in Fig. 02), which use a more even-handed strategy with no band gaps 
separating the activations. Around the phase transition line a, the globally minimal state switches, on increasing 
a, from one with sacrificial strategy to a more even-handed one. This discontinuous change in learning strategies is 
illustrated in Figs. pKa-d) and ra(e), where the phase transition line a is crossed over on increasing a for a given A. 

Outside the region with multiple states, the magnitude q of the student weight vector decreases above line c (since 
weight decay becomes strong) or to the left of the line d (since examples are not enough) . Hence the weight vector is 
not flexible enough to allow for multiple strategies. In general, the fraction of sacrificed examples is smaller in this 
region. This reduces the difference between the strategies of sacrificing and not sacrificing the examples. As a result, 
all states to the left of line b learn with a single sacrificial strategy and to the right of it with a single even-handed 
strategy. 

VI. CONCLUSION AND REMARKS 

We have studied the supervised learning of of noisy examples in nonlinear and differentiable perceptrons using 
the cavity method. The existence of band gaps in the activation distribution is demonstrated and is attributed to 



the competition of conflicting information inherent in noisy data, and the nonlinearity of the student perceptron. 
Activations corresponding to preferred or sacrificed examples during learning are separated by band gaps. The band 
gap in the activation distribution is an indication of the extent of information competition and the roughness of energy 
landscape, corresponding to the effects of RSB in the replica approach. The more prominent the band gaps, the more 
significant the effects of rough energy landscapes. When both the noise and the training set are large, there is a phase 
transition in the student perceptron from a poor generalization state with a long weight vector to a good generalization 
state with a short weight vector. The phase transition is accompanied by a change in the learning strategy across the 
phase line from sacrificial to even-handed. We present the phase diagram of this system, together with the boundaries 
of the gapped regime and of the metastable region. The relation between band gaps and the picture of a rough energy 
landscape was discussed in a previous study [g9|. Here we further show where this consideration is most necessary. 

We remark that the preferential or sacrificial effects are common in many other learning systems, such as multilayer 
perceptrons |2^] and weight pruning networks |33| . They create metastable states which cause the hysteretic behavior 
as shown in our simulations (see Figs. 9 and 11). The presence of metastable states prevent the convergence of 
dynamical learning process to the ground state. Hence it is an important issue in the practical implementation of 
learning dynamics. 

We have illustrated that the cavity method can be used to analyze systems laden with complex information, yielding 
predictions identical to the replica method, yet providing a more physical interpretation. It can be applied to other 
systems such as Support Vector Machines (SVM) when examples are noisy and insufficient [[J4|. SVM learning of 
clean examples has recently been studied using the replica theory |35|]. However, since the functional form of the 
energy is different, band gaps may not be present. Nevertheless, a cavity analysis of SVMs could offer new valuable 
insights. 
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APPENDIX A: THE CAVITY ACTIVATION AND LOCAL SUSCEPTIBILITY 

From Eqs. (f|) and (||) and the definitions of to and xo, we obtain 

zo - to = ~(O - fo)f + -^ 5}(0„ - /°)(/°)' (O m - fM%$. (Al) 

Expanding the last term to first order, and assuming that x M is a well defined function of t^, we arrive at 

t / „ „ n „/ I X "V r / „/ n 9 / „ „ n „//n ox u 

+ 7^EH^) 2 + <P M - U)0^(^ - #){$)*& (A2) 

where J k and J} denote the student weights trained with training sets without example fj,, and respectively, with 
and without example 0. Note that J2k(^ 3 )( J k"" ~ J k^k/^ « *° - t M ~ 0(A~ 1/2 ) and is uncorrelated with 
£j\ Neglecting the dependence of [—(J/,) 2 + (Op — fi^f'^idx^/dt^) on £%£? that is of order A -1 , we conclude that 
the second term on the right hand side of (A2) is of order A" 1 / 2 and hence negligible. In the last term, (£^) 2 is 



*>-*> = \(o f )f + ^= Eh(/;) 2 + (o„ - U)Q d -^m* E ( J " v - ■#*)€ 



uncorrelated with (J- ^ — J- )£? and hence can be replaced by its average value of 1. For the remaining summation 
over j, Y^rj{Jj ~ Jj)£j/vN reduces to Xq — £q . Assuming that the change in the activation difference x — t of 
examples due to the removal of example [i is small, Xq — £q further reduces to xq — to. Thus 



x Q - t 



(Oo - fo)f + t^7 EH/m) 2 + (°p - U)Q-sr(*o - f o)- (A3) 
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Defining the local susceptibility 7 by 



1 = *+jrT,W-(o»-fM?j2-, 



(A4) 



we arrive at Eq. (|9|) . A pplying the same cavity argument to example /j,, t^ and x^ should also be related by Eq. (|S|). 
This simplifies Eq. (Ia|) to 



7 * ^ , dtf, . dx^ 



- = A +-E ( - ' 



N ^ y dxa "' dtn, ' 



(A5) 



from which Eq. (J1QJ) follows. 



APPENDIX B: THE STABILITY CONDITION 



In obtaining Eq. (A.2), the validity of the perturbative expansion in Eq. ( |Al[ ) is subject to the condition that the 
fluctuation Aj = ^2AJ9 — Jj) 2 is finite. Subtracting Eq. (0) by Eq. (H). multiplying both sides by J,° — Jj and 
summing over j, we obtain 



A.J = — (Xq -t ) 

A7 



a 7 ^ [ dt^ > at„ ^ 



t,f 



(Bl) 



where Eq. (||) is adopted, and x^ is assumed to be a well defined function of t M . The factor (i° — t^) 2 in Eq. (Bl) 



7 o\m 



rVv 7°V 



rV\ 



can be expanded as J2jk(Jj ~ Jj )(^k ~ ^k )£j£k/N and is only related with (1 — dx tl /dt fJ _)(dx^/dt^) in the 
order (9(iV _1 ). Therefore the average over [i of the former and later's product can be replaced by the product of their 



o\m 



rVv t°\V 



J^ ) are uncorrelated with examples £ M , only terms with j = k contribute to 



averages. Since (J. xtl — J^)(J i 

the average over /i corresponsively. Then XL(^u — t^) 2 becomes Aj M . Assuming that the change in Aj due to the 



removal of example [i is small, this further reduces to Aj and renders (Bl) to 



A7 



(xo - t a ) 2 - 



1 



A7N 



Sl'-lf't^ 



Using the relation (0) between the generic and cavity activations of example \x, this can be further reduced to Eq. pj 
APPENDIX C: THE EFFECTS OF A GAP ON MEAN FIELD EQUATIONS 



When there is a gap in the distribution P(x\y) of student activation x for a given teacher output j/, the mean field 
equations (15), (M) and ( pl[ ) are not exact since x^ is no longer a differentiable function of t M . Nevertheless, we can 
obtain its mo dified expression from self-consistent considerations. 

In Eq. ( |Al| ), the summation over /j now includes different situations depending on the value of the cavity field t^. 
For those examples with i M and t° located on the same side of the gap, the analysis is similar to the that in Appendix 
A. However, if £„ is close to the gap position t g , then when the new example is included in the training set, the 

change of cavity activation Af M = J2k( J k ~ J k)€k/^N may give rise to large value of (O m - /°)(/°)' - (0 M - />)/£ 
as the generic activation x^ changes from x<(£ A1 ) to above x>(£ M ) or reverse. We distinguish the following cases to 
calculate the summation in Eq. (Al). 

The first case corresponds to t g — At^ < t^ < t g . Among the p examples, this happens with probability 8(t^ — 
t g (y^))At l j,6(At l j,). Its contribution to the summation in Eq. (Al) is 



{Case 1} fJ-j 

*{[/(&) - f(x>)]f(x>) - [f(y,) - /«)]/'«)}##. 
Similarly, the second case corresponds to t g < t^ < t g — At M , with the contribution 



(CI) 
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E 4e% 



{Case 2} 



x{[/(pm) - /«)]/'«) - [/GW - /«)]/'«)}##• 

Combining them together, we have the total contribution from the gap 

E = f(»0-*o) /dl/P(») fdyP(y\y) f dtP(t\y)S(t - t g (y)) 

{Gap} J J J 

x{[f(y) - f(x>(t,y))]f'(x>(t,y)) - [f(y) - /(*<(*,£))]/'(*<(*,£))}■ 

Simplifying the integrals, we have 



E 

{Gap} 



Xq - t 



2tt ( ? . 



_R 2 s 
1+T 2 , 



T)u exp 



\/l+T 2 



2(g- 



fl 2 ' 

1+T 2 , 



x{[/(i/) - /(^fe.y))]/'^^.^)) - [/(y) - f(x<(t g ,y))]f'(x<(t g ,y))h 



(C2) 



(C3) 



(CM) 



with y = \/l + T 2 u. Therefore, we obtain the self-consistent equation (30) for 7 and the function t(x) in Eq. (|j), 
where x is related to u and v by Eq. (Eq). The positions of band gap t g , x < and x > are determined using the Maxwell's 
construction discussed in Sec. IV. Following Eqs. (j3(j), @ and (Eg), we get the equation of R with extra terms ( |3l| ) 
and equation of g without extra term (J2l|), after elaborate work on integrating by parts. 



APPENDIX D: CONDITION FOR MAXWELL'S CONSTRUCTION 



For a given teacher output f(y), x is a multi-valued function of t when t'(x) < at the inflection point t"{x) = 0. 
For the sigmoid function f(x) = [1 + e _:c ] _1 , this implies 



1 / 2 (l-/) 2 (l-3/ + 3/ 2 ) 
2 7 1 - 6/ + 6/ 2 

where / represents f(x) at the inflection point and can be solved by 

/(4-15/+12/ 2 ) 



m 



1 - 6/ + 6/ 2 



(Dl) 



(D2) 



Note that the conditions (Dl) and (D2) are invariant when / and f(y) transform to 1 — / and 1 — f(y) respectively. 
Hence the region of Maxwell's construction are symmetric with respect to the point (/, f(y)) — (1/2, 1/2). Thus, we 
obtain the condition of Maxwell's construction (33) if we define W as the parametric function of 7 via 



W 



(2/-l)(l-12/ + 12/ 2 ) 



and 



2(1-6/ 



1 - 6/ + 6/ 2 



2 ) 



2/ 2 (l-/) 2 (l-3/ + 3/ 2 ) 
The function Wfa) for 7 > is plotted in Fig. g. 



(D3) 



(D4) 
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FIG. 1. The Maxwell's construction to determine the position of band gap t g . In the figure, the areas of the two shaded 
regions equal to each other. The labels of the points are to be compared with the lines in Fig. H. 



FIG. 2. The function W of 7 defined by Eq. (D3) and Eq. (D4), which is introduced to determine the condition of band 
gap occurrence. 

FIG. 3. The occurrence of band gap and preferential learning when 7 = 20.55. The picture is symmetric with respect to the 
point (1/2,1/2). For intermediate teacher output 0.136 < f(y) < 0.864, no band gaps exist. For related discussions in Figs. vL 
M and Table III when a — 3 and A = 0.002, the present value of 7 corresponds to T — 2.5, and the choice of y = —y/1 + T 2 
corresponds to the line of f(y) = 0.063, which cuts the boundaries of the shaded region at x = 1.18 and x = 2.78, indicating a 
band gap of P{x\y) at [1.18,2.78]. 

FIG. 4. Regimes of the existence of gapped activation distribution and the regimes of unstable states for different noise 
levels. Below the solid lines, the perturbative cavity solutions are unstable, and below the dotted line, band gaps will appear 
in the activation distribution. The point (a — 3, A = 0.002) is denoted by a star. The shaded regions indicate the existence of 
discontinuous phase transitions to be discussed in Sec. M. (For T — 0.1, the shaded region is too small to be shown.) 
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FIG. 5. Theoretical and simulation results of student activation distributions, indicated by solid and dashed lines respec- 
tively, when a = 3, A = 0.002 (denoted by a star in Fig. (ft)) and y — —y/l + T 2 for different noise temperatures. The arrow 
in (b) shows the position of a pseudo gap and the arrows in (c) show the band gap [1.18, 2.78] from the theoretical prediction 
in Fig. (§). 

FIG. 6. The theoretical prediction of the student activation distributions at T = 5 and A = 0.001 for different sizes of the 
training set a, where y = 0. When a = 5, the system has two states. Respectively, (d) and (e) are the distributions when the 
system is in the poor and good generalization states. 

FIG. 7. The dependence of local susceptibility 7 on the weight decay strength A for different a at T = 1. All curves 
approach A = when 7 goes to infinity. 

FIG. 8. The dependence of generalization error e g on the weight decay strength A for different sizes a of training set when 
the noise temperature T = 1. Inset illustrates the three branches of energy curve for a = 1.9. 

FIG. 9. Variations of e g , et, R and q of a sample of simulation when the weight decay strength A changes at T = 5, a — 4 
and N — 50. The arrows in (a) denote the routes of changing. 



FIG. 10. The learning curves for different weight decay strengths: (a) T — 1; (b) T — 5. 



FIG. 11. Simulation versus theoretical results for the generalization error on changing a. (a) T=l and A = 0.0001, the 
simulation result is the average over 14 samples, (b) T — 5 and A = 0.001, the simulation result is the average over 20 samples 
on decreasing and increasing a. In all simulations, the number of input nodes N = 150. 

FIG. 12. The energy E/N (solid line) and the magnitude of student vector q (dotted line) versus the size a of training set, 
T — 5 and A = 0.001. The phase transition point is determined from the crossover of the two branches of the energy curve, 
a* = 5.95, and the spinodal point of the good generalization state is at a g p — 4.4. 



FIG. 13. The phase diagram for nonlinear perceptrons learning noisy examples when T 
= a* = 1.65. Line d terminates at a = «o = 1.737 when A approaches zero. 



1. P is the critical point with 



TABLE I. The comparison of macroscopic parameters and errors obtained from theory (Roman) and simulation (italics in 
brackets) for different T when a — 3 and A = 0.002. 

T 7 R q e t e 



0.1 


11.1 


0.963 (0.961) 


0.933 (0.932) 


0.018 (0.017) 


0.027 (0.027) 


2.0 


14.9 


0.796 (0.795) 


2.211 (2.196) 


0.236 (0.236) 


0.376 (0.375) 


2.5 


20.6 


0.837 (0.822) 


3.816 (3.574) 


0.260 (0.260) 


0.437 (0.431) 


5.0 


80.1 


0.920 (0.848) 


23.05 (16.34) 


0.275 (0.301) 


0.577 (0.570) 
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